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Abstract — We study the problem of optimal traffic pre- 
diction and monitoring in large-scale networks. Our goal is 
to determine which subset of K links to monitor in order 
to "best" predict the traffic on the remaining links in the 
network. We consider several optimality criteria. This can 
be formulated as a combinatorial optimization problem, 
belonging to the family of subset selection problems. Similar 
NP-hard problems arise in statistics, machine learning and 
signal processing. Some include subset selection for regres- 
sion, variable selection, and sparse approximation. Exact 
solutions are computationally prohibitive. We present both 
new heuristics as well as new efficient algorithms imple- 
menting the classical greedy heuristic - commonly used to 
tackle such combinatorial problems. Our approach exploits 
connections to principal component analysis (PC A), and 
yields new types of performance lower bounds which do not 
require submodularity of the objective functions. We show 
that an ensemble method applied to our new randomized 
heuristic algorithm, often outperforms the classical greedy 
heuristic in practice. We evaluate our algorithms under 
several large-scale networks, including real life networks. 

I. Optimal Network Monitoring 

Consider a (communication, supply, or transportation) 
network consisting of N nodes and L links. Traffic 
is routed over the network along predefined sets of 
routes described by the matrix A — (a^j)L X j, with 
ci£.j = 1, when route j uses link £ and otherwise. 
Typically the total number of routes J = N(N — 1) 
equals the number of all possible (ordered) pairs of 
source and destination node, i.e. (s,d) pairs. We shall 
ignore network delays and adopt the assumption of 
instantaneous propagation. This states that for each 
route j, the traffic propagates instantaneously. This is 
reasonable when traffic is monitored at a time-scale 
coarser than the round-trip time (RTT) of the network. 
Under this assumption, we obtain that the traffic loads on 
all links and all routes in the network are related through 
the fundamental routing equation: 

y(i)=^x(t). (1) 

Here y(t) = (ye(i))f = i and x(t) = (xj(t))j =1 represent 
the vectors of traffic loads on all links and all routes in 
the network, respectively. 

We consider several design problems from the per- 
spective of network prediction (known also as kriging). 



In a static context, the traffic loads x — {x(t)}t£if 
follow a stationary time series with certain marginal 
mean vector fi x and co variance matrix E x . The temporal 
dependence here will not be taken into consideration. 
Consequently by ([TJ, the traffic loads on all links in 
the network y = {y(t)} t ^ form a stationary time 
series with mean fi y = Afi x and covariance matrix 

In a dynamic context, in addition to the information 
about the marginal distributions of the traffic loads x(t) 
and y(t) a concrete temporal dependence structure will 
be imposed or derived from a spatio-temporal traffic 
model. In this paper, we concentrate only on the static 
context. Therefore, from hereafter we drop the depen- 
dence on time for notational convenience. 

In the network prediction problem, sometimes referred 
as kriging, one observes the traffic loads y Q = (ye)eeo 
on a set of links O C C := {1, • • • ,L}. The goal is 
to predict the traffic loads y u = (yi)igu, where U := 
{1, • • • ,L}\0. Depending on the available information, 
one can apply the classical standard, ordinary or univer- 
sal kriging approaches from spatial statistics ([ 1 1). With- 
out loss of generality, we consider the above design prob- 
lem in the simplest possible case where x ~ M(fi x , Ij) 
is a standard multivariate normal random vector with 
known mean fj, x . Therefore!! y ~ Af(fJ, y ,AA T ), where 
fiy = Afi x . Hence, the traffic means fi y and covariances 
Yi y are known. Then ordinary kriging provides the best 
linear unbiased predictors (BLUP) for y u via y a : 

Vu = M« +£« £~ 1 (y - Mo), (2) 

where 

Mo) ^ S *=Uo« E~J 
are the partitions of the mean and the covariance into 
blocks corresponding to the unobserved (it) and observed 
(o) links. 

1 Note, that in the general case when T, y = ATi x A T we 
can express the covariance (hence, positive definite) matrix T, x as 
T, x = T,l /2 T,l /2 . This gives £„ = At}J 2 y}J 2 A T = AA T with 

1/2 

A = AH X being the matrix imposed to be used in the following 
analysis. 



If no information about the mean is available, then 
ordinary kriging may be applied where the traffic means 
across different links are assumed to be equal to an 
unknown constant. Alternatively, if additional network- 
specific information is available, a universal kriging 
type approach based on modeling the network-wide 
means can be adopted. The model can be fit by using 
measurements at a few links and then used for prediction. 
See HI for more details. 

In the dynamic context, different sets of links may 
be observed at different times, e.g. in a 'round robin' 
fashion. Then, in the context of a judicious model for the 
spatio-temporal dependence over the network, one can 
formulate and address various network kriging problems. 

Consider the static context outlined earlier and sup- 
pose that the traffic means fi y and covariances S y 
are known. Thus, the ordinary kriging estimator in (f2]i 
provides the BLUP for y u via y a . 

Problem 1 (monitoring design). Given that no more than 
K = \0\ links can be observed, determine the 'optimal' 
collection of links to monitor O. Different 'optimality' 
criteria are possible. Let S err := K(y u — y u )(y u — y u ) T , 
be the prediction mean squared error matrix, where 
y u := E(y u \y ). Determine O, which minimizes: 

. trace(Serr) := £ ieW Var(y n |y ) = E\\y u - y u \\ 2 , 
where \\ ■ \\ stands for the Euclidean norm. 

• || S e rr || 2, where || • || 2 stands for the spectral matrix 
norm, i.e., largest eigenvalue of S crr , henceforth 
written as p(S CIT ). 

The optimization formulation for our optimal moni- 
toring design problem is to find the optimal set O* such 
that: 

Z(0*) = mmf(0) 

s.t. \0\ = K, (3) 

where C — {1,2, ...,£} is the set of network links, 
Z(0) is the prediction error when monitoring links in 
set O C C and function /(■) is the optimality criterion 
(i.e., trace or spectral norm). For example, for the trace 
case, f{0) := tracc(S crr (£>)). 

In experimental design lingo, the former criterion is 
known as A-optimality and the latter as E- optimality . 
Another classical criterion encountered in experimental 
design is D-optimality, defined as det(Sg tt ), but for 
space considerations we will not consider it in this 
paper. The motivation behind considering many differ- 
ent criteria comes from applications. For example, one 
natural criterion is the trace of the error covariance 
matrix, which equals the total sum of the variance of 
the estimates of the unobserved links. Another point of 
view involves the minimal volume confidence sets for y u . 
These correspond to level-sets of the conditional density 



function. E.g., under normality, we have 

Vu\Vo ~ N'iVu, Sim - SuoSj/Sou), (4) 

where y u is given by (ffji. Therefore, the minimum 
volume confidence set for y u in this case is of the form 

Uu + {yeR d : y T S-.iy < c} 

for some constant c > and d := \U\. 

It may be of interest to find a design (i.e. a set O) that 
minimizes for example the diameter, the volume or some 
other characteristic of the above confidence set, given a 
fixed confidence level a. This gives rise to the E-optimal 
and D-optimal criteria, respectively. 

The optimal monitoring design problem (f3]) is NP- 
hard, in general. The problem can be reduced to the 
NP-complete Set Cover problem. An alternative reduc- 
tion and proof for NP-hardness is given in Q. The 
complexity of the covariance matrix S y correspond- 
ing to real, large-scale networks makes this problem 
a computationally intractable problem in practice. An 
exhaustive search implementation in CPLEX ran on 
our institution's cluster required more than two days 
to converge to the optimal solution for a small-scale 
network of 9 nodes and 26 links. Other methods, such as 
the mixed-integer program with constraint linearization 
and the branch-and-bound algorithm presented in [3] 
that require submodularity for the objective function, 
are also computationally expensive and impractical for 
even moderate-scale networks. Furthermore, even though 
the trace objective function can be submodular (see [4J, 
[2]), other criteria of interest such as E-optimality and 
D-optimality are not. These challenges show the need 
for approximate but fast heuristic algorithms for solving 
optimization problem (0. 

A well-known algorithm to quickly find approximate 
solutions to Problem (Q~|i is to perform the following 
forward selection procedure. Start from the empty set, 
and add elements one at a time, adding at each step the 
link that minimizes the prediction error Z the most (or 
equivalently, maximizes the error reduction the most). 
We call this, the greedy heuristic. Let 6j(0) = Z(0) — 
Z((D |J{i}) be the error reduction when adding element 
j to the set O. Below, we give its formal description as 
described by Nemhauser et al. 0. 

Greedy Heuristic. The steps of the algorithm are: 

1) Let 0° = 0, J\f° = C and set k = 1. 

2) At iteration k, select ik G N k ~ 1 such that 

i k G argmax5 i (O fc " 1 ) (5) 

with ties settled arbitrarily. 

3) Set S k -! = 8 ik (p h - x ). if 4-i < 0, stop. 
The selected set is C fe_1 . Otherwise, set O k — 
O k - x \J{i k } and N k - AT"- 1 \ {i k }. 



4) If k = K stop and output O . Otherwise, set 
k = k + 1 and go to Step 2 ). 

In this paper, we propose a new heuristic for Prob- 
lem (Q]) as well as new, efficient implementations of 
the classical greedy heuristic. Notably, the naive im- 
plementation of the greedy heuristic is not feasible 
for large-scale networks since it involves (in our case) 
the inversion of high-dimensional matrices. We exploit 
the natural geometric structure in the problem, which 
is related to principal component analysis (PCA) (6). 
This structure motivates new, projection-based heuristics 
and also leads to exact, efficient implementation of the 
classical greedy heuristic, that avoids matrix inversions. 
The connections to PCA also yield lower bounds on the 
error reduction in Problem (|T), which are of independent 
interest. The efficiency of our algorithms allows us 
to apply them to very large-scale networks. By using 
multiple, parallelized executions of randomized versions 
of our algorithms, we arrive at non-greedy solutions 
through the method of ensemble. For small and medium- 
scale networks this can often come close and, in fact, 
match the optimal solution of Problem ((TJ. 

The combinatorial nature of the optimal monitoring 
design belongs to the family of subset selection prob- 
lems, and is encountered in scientific areas ranging from 
computer science and engineering to statistics and linear 
algebra. In [3] a bank location problem is formulated as 
an interger program and algorithms based on branch and 
bound are employed to retrieve the exact solution. The 
problem, however, is NP-hard so exact algorithms are 
computationally inefficient. Hence, in [5| approximation 
algorithms are studied that exploit the submodularity 
property of the objective function. (In words, submodu- 
larity is for set functions the analog of what convexity 
is for real number functions.) Polynomial-time, greedy 
heuristics along the lines of the one shown above are 
offered that yield a solution within (1 — 1/e) of the 
optimum, where e is the base of natural logarithm. 

In [2j, near-optimal sensor placement for temperature 
monitoring is examined. Temperatures at different lo- 
cations are modeled with a multivariate Gaussian joint 
distribution. The objective is to place a small num- 
ber of sensors so as to optimally predict the temper- 
ature at other locations. The objective is to maximize 
the information-theoretic mutual information criterion. 
Apart from the criterion used, the problem is very similar 
to ours. The authors exploit submodularity and propose 
heuristics that outperform the classical implementation 
of the greedy algorithm. However, these heuristics are 
faster than classical greedy only in special cases such 
as when the covariance matrix of the joint Gaussian 
distribution has "low-bandwidth0". The structure of the 

2 A covariance matrix £ has bandwidth f) when the variables can be 
ordered such that T,ij = when \j — i\ > /3. 



covariance matrix is also exploited in [7| where the 
problem of subset selection for regression is studied. 
The setup includes a large number of variables Xi that 
can be observed, and the interest is on the value of 
the predictor variable z. The problem is to select the 
fc-subset of variables Xi that "best" predict z in the 
sense of minimizing the mean squared prediction error. 
This corresponds to Problem Q] under the trace criterion 
in the special case of searching for the subset of links 
(variables Xi) that predict a specific link (variable z). The 
authors propose fast, exact algorithms based on dynamic 
programming. These algorithms may be employed only 
in the special cases of "low-bandwidth" and "tree covari- 
ance" graphs. Unfortunately, such special cases do not 
arise very often in our problem due to complex network 
structures and heterogenious traffic sources. Thus, the 
algorithms in ||2|, IF71 are not practical for real- world 
instances of our optimal monitoring problem. 

In mathematics and signal processing community the, 
so called, sparse approximation problem is also similar 
to ([3J. In that context, the goal is to find a "sparse" 
subset of a dictionary V = {4>i]ie{i,...,\v\} of unit 
vectors that span 1^, so that a linear combination of 
the selected vectors best approximates a given signal 
A £ R*. Two common approaches, namely matching- 
pursuit and othogonal matching pursuit are discussed 
in (HI and a new heuristic with an exponential speedup 
over those two approaches is proposed. 

Another category of subset selection combinatorial 
problems is the problem of variable selection |9), iflOl . 
which seeks the "best" subset of exactly K columns 
from a m x n matrix A. The problem has been ex- 
tensively studied by the linear algebra and statistics 
community. One commonly used technique, namely the 
Lasso method ifTTI . relaxes the constraint on the number 
of variables sampled by using an alternate one on the 
regression coefficients vector. By doing so, the integrality 
constraint vanishes, and the problem can then be solved 
using standard quadratic programming algorithms. 

The remainder of the paper is organized as follows: 
in Section [IT] we describe, analyse and give the intuition 
behind our approximation algorithms. In addition, we 
discuss the PCA-based lower bound on the perfromance 
of our heuristics, and elaborate on the idea of ensem- 
bling for enhancing the performance of our algorithms. 
We evaluate our methods in Section [HI] in a variety 
of scenaria, including very-large scale and real-world 
networks. We draw our conclusions in Section HVl 

II. Approximation Algorithms 

A naive implementation of the described greedy al- 
gorithm in both A- and E-optimality contexts has a 
complexity of 0(KLn 3 ) where n ~ max{L, J}. This 
is because a matrix inversion (0(n 3 ) computations) is 
involved whenever 8i(O k ~ 1 ) is calculated. This section 



presents fast, approximation algorithms that substantially 
reduce the computational complexity for solving Prob- 
lem CD- 

Let Sj, = AA T be the covariance matrix for the model 
y = Ax. The error covariance matrix from © (in- 
cluding both the observed and unobserved links) equals, 
E err (0) = E - EoE-^, whereS = AA T a , E 00 = 
A Aq, and the partition matrix = (aij)i^o,j&C- One 
can also write the latter matrix as follows: 



<{0)=A{Ij-A t {A A t \ 
= A(Ij - P )A T , 



1 A )A T (6) 



where P a := A T {A A T )~ X A Q . The matrix Fo is the 
projection matrix onto the space Wo '■— Range(A^), 
i.e., the space spanned by the column vectors {a^ : G 
R J , i G O} of A T , where O is the set of the observed 
links. Therefore, (Jj — Pq) is the projection matrix onto 
the orthogonal complement Range(A^) ± . 

One can show that \\A-AP \\ 2 F = tr[A(Ij-P )A T ] 
and \\A - AP \\j = p[A(Ij - Po)A T \. Consequently, 
Problem (fl~|i is no different than the combinatorial prob- 
lem of finding the set O such that: 



O := argmin \\A 

\0\=K 



APc 



o 



(7) 



where || • ||^ is the spectral norm when £ = 2 and the 
Frobenius norm when £ = F, 

In a geometric perspective, (0 seeks the "optimal" 
space Wo '■= Range (A^) such that the distances, 
under the Frobenius^ or spectral norm, of the column 
vectors € R J , i G £ to the space in question are 
minimized. A known result from PC A (see lfl2l . Jllfl 
and Appendix [A} is that the optimal space W-pcaO is 
given by P© = VkVk where Vk = vi, • • • , «k) is 
the matrix of the top right singular values of A, or 
equivalently, the K principal eigenvectors (components) 
of A 1 A, This geometric observation gives a lower bound 
for the error and also suggests geometric heuristics. The 
spaces Wo in © should be spanned by the vectors 
a,i,i G C corresponding to the K observed links, 
so in principle, the PCA-based lower bound cannot 
be achieved. However, we can think of a sequential 
"greedy" method that picks the space Wo that has 
smallest "angle" with the space Wp CA . 

One idea is to first pick a link ii G C for which the 
vector a il is "closest" to the first PCA component v\. If 
K = 1, by © this would be the optimal choice where 
the notions of "close" and "optimal" depend on the type 
of norm or optimality criterion. If K > 1, we "subtract" 

3 The notion of distance becomes more intuitively clear in the 
Frobenius case. As shown in Appendix fA] it takes the form of the 
Euclidean distance. I.e., trace(S orr (C')) = 2^ =1 ||(-f — Po) a l l| 2 = 

4 Note that this means that we relax the constraint that wants the 
space to be spanned by vectors of A T . 



the effect of the chosen link by considering only 
the orthogonal projections of the remaining links onto 
the space spanja^j^. Now we focus on a new matrix 
A consisting of the above projections and repeat the 
procedure iteratively K times. 

Formally, the matrix A is updated as follows. Assume 
the iterative procedure selects link i& at step k to be 
added to the set of selected links O. We update A with 
the rule 



A (k) =A (k-i) fj 



(fc-i) (fe-i) J 

ll„( fe - 1 )||2 



(8) 



with A^ = A. The following proposition shows that 
the prediction error covariance matrix can be updated 
sequentially by using projections. This sequential pro- 
jection feature is the key behind the computational effi- 
ciency of the proposed heuristics presented in the sequel 
since it allows us to avoid expensive operations such as 
matrix inversions. Its proof is given in Appendix iBl 

Proposition 1. Let O := Ok be the set of selected links 
at the end of step k. Then 

A{k ) A ( k r = Scrr(0) = A(Ij _ A^A^y^A^, 

(9) 

with A^ updated as shown in Eq. (0. 
A. A-optimality (trace) criterion 

In our first heuristic, we implement the geometric idea 
discussed above in which, at step k, we select the link 
ik G C whose vector is closest to the principal 
component of the current version of A. We can choose 
between two options for vector proximity; the smallest 
angle or the longest projection. We decided to work with 
the latter. We obtain the principal component using the 
power method [12|. The steps of the algorithm are: 

Algorithm 1 (PCA Projection Heuristic - PCAPH). Let 

O = and = A Set k = 1. 

1) Find the principal component (i.e., the principal 
eigenvector), namely Vi of A^^ 1 ^ A^ 1 ^. Make 
use of the power method (see HTS ) to obtain a 
fast approximation o/vi. 

2) At iteration k, choose: 

^fc G argmax \v 1 a> '\ 



where a\ k 1] G R J ,i G {1,2,..., L} are the 
column vectors of A( k ~^ . Put ik in the list of 
links to be monitored, i.e., Ok = Ofe-i U{**;}- 
3) Consider the orthogonal projection of the columns 
of A( k ~^ onto the orthogonal complement of 
spanja*}, where a* := a\ l \ Thus, obtaining 



iec\o k -i 



matrix A^ k \ as follows: 
A {k) _ A (k-1) 



a a 



(10) 



4) Set k = k + 1. If k < K, go to step 1). 

One can show that the error reduction at each step is 
equal to: R(k) = £ t L =1 \\P k (a\ 



| 2 , where P k is the 

projection matrix to span{a|^ This heuristic does 

not always pick the link that reduces the error the most. 
Instead, at each iteration, it picks a vector (link) that 
is closest to the PCA principal component. As shown 
in Section Hill this strategy usually yields a monitoring 
design with slightly larger prediction error than the 
greedy strategy, but is orders of magnitude faster in 
execution time. Step 1) requires 0(mJL) computations, 
where m is the number of iterations the power methods 
needs. We need O(LJ) computations per iteration for the 
matrix-vector multiplication in the power method loop. 
Steps 2) and 3) could be done in 0(LJ) operations. 

Lemma 1. The complexity of the PCA Projection 
Heuristic is 0{mKLJ). 

One may try instead to pick row vectors of A, 
...,a,i K , such that a,i k is "closest" to the fcth prin- 
cipal component direction. This heuristic, however, does 
not yield good results for large K due to accumulation of 
errors and since the PCA provides only a lower bound. 

The next algorithm, is a fast implementation of the 
classical greedy algorithm. It avoids calculating the 
inverse of the covariance matrix S OQ . Instead, at each 
iteration we seek the column vector that maximizes the 
error reduction the most. This is equivalent to finding 
the vector that maximizes the squares of the projections 
of the remaining vectors onto itself. This step is accom- 
plished by the first step of the algorithm. The second 
step, updates the matrix A^ k \ 

Algorithm 2 (Fast Greedy Exact - FGE). Let O = 

and = A. Set k = 1. 
1) At iteration k, choose: 



i k € argmax 



iec\c 
.(fc-i) 



.( fe -!)| 



(ID 



where af ±J £ R J ,i e {1,2,...,L} are the 

column vectors of A^ -1 ' . O k = Cfc-i U{*fc}- 
2) Do Step 3) of Algorithm^ i.e., 



A (k) =A (k-l)( T 



with a* := a 



(fe-i) 



(12) 



3) Set k = k + 1. If k < K, go to step I). 



Step 1) is a "greedy step" since it seeks to pick 
the link that reduces the error the most. It requires 
0(JL 2 ) operations. Step 2) requires O(LJ) operations 
after suitably rearranging the order of operations. The 
following lemma holds. 

Lemma 2. The computational complexity of the Fast 
Greedy Exact algorithm is 0(K J L 2 ). 



B. E-optimality (spectral norm) criterion 

For E-optimality, we present two fast implementations 
of the greedy heuristic. Unlike the classical greedy 
heuristic, relation © and Proposition [TJ allow us to 
avoid computationally expensive operations like matrix 
inversion and singular value decomposition. Therefore, 
our two greedy heuristics perform drastically better than 
the naive implementation of greedy. In addition to these 
two heuristics, we emphasize that a similar heuristic 
as Algorithm [TJ could also be implemented. For space 
considerations, we skip this description (since it is ex- 
actly the same as the one presented for the trace case), 
and present two new greedy algorithms that demonstrate 
how the aforementioned expensive operations could be 
avoided. Again, we utilize the power method to lead us 
to the largest eigenvalue of the error covariance matrix. 

Algorithm 3 (Fast Greedy Power - FGP). Let O = 

and A^ — A. Set k = 1. The steps of the algorithm 
are: 



1) At iteration k, choose: 



i k e argmin || S err || 2 , 
ie£\Ofc-i 



(13) 



E^- 1} = A^)(I.j 

Ak-X) 



(fc-1) (fc-l) J 



anda\ £ M' 7 , i € {1, 2, . . . , L} are the column 
vectors of A^^ 1 ^ . Use the power method to ob- 
tain the principal eigenvector of S crr . The largest 
eigenvalue, namely Ai, follows from its definition 

(i.e., Ax = vf ScrrVlj. O k = O k -i \J{i k }. 

2) Do Step 3) of Algorithm^ 

3) Set k = k + 1. If k < K, go to step 1). 

The computational complexity is analyzed next. In 
Step 1), note that the calculation of the error covariance 
matrix, £ C m can be simplified as follows (we symbolize 
^(fc-i) s j m piy as a due to space limitation): 



^- 1 H i ) = A(l J 



(fc-i) (fc-i) J 

||„(*-1)||2 



A 1 



(fc-l)w Jk-l) J 



AA 



(^ar iJ )(a 



A T ) 



,( fc -!)| 



(14) 



To find i k in Setp 1), we sequentially search all links 
i G C, to find the one that minimizes the error the 
most. The first term of ( fT~4-b costs 0(JL 2 ) operations 
and needs to be calculated only once per iteration k 
since is independent of the link choice. The second term 
requires O(JL) operations. The power method is run 
m times and requires O(LJ) operations. Thus, Step 1) 
requires 0(mL 2 J) operations in total. In Step 2), costs 
O(JL) operations. 

Lemma 3. The complexity of the Fast Greedy Power 
algorithm is 0(mKL 2 J). 



The second algorithm for the spectral norm criterion is 
also based on relation (O and Proposition Q] Moreover, 
we utilize the definition of the largest eigenvalue: 

p(E crr ) :— Ai(Sorr) := max a; T E orr a;. (15) 

»eR i ,||x||=i 

Algorithm 4 (Fast Greedy Randomized - FGR). Let 

O = and = A Set k = 1. 

1) Generate m independent, normally distributed 
random vectors x, £ R L , i = 1,2, ... ,m from 
N(0, 1) and set X{ := a;^/ ||as-i || 2- 

2) Af iteration k, k — 1,2, . . . , K, calculate: 

Cl := (xf A^" 1 )) • (A^ 1 ) Vi = 1, 2, . . . , m. 

(16) 

3) At iteration k, select: 

Tu(fc-l)u(fc-l) T 
. f xfb) >b) X; i 

j fc G argmin ^ max Cj jTzry— f' 



where b 



(fc-i) 



(17) 



R J ,j G {1,2, ...,L} are f/ze column vec- 
tors of A^ k ~^> . This corresponds to finding 
the link j that minimizes the error \\A^ k '(I — 
a i aJ/||a,-|| 2 )AW T || 2 . O k = O k - X {j{i k }. 

4) Do Step 3) of AlgorithmU] 

5) Set k = k + 1. If k < K, go to step 2). 

First, we randomly sample m unit vectors Xi : x; G 
R L , i = 1, 2, . . . ,m, ||£Cj|| = 1. We use these vectors to 
approximate the largest eigenvalue. Note, that in Step 
2) we save the a values so as to omit unnecessary 
repetitions of the same quantity. We multiply the terms in 
parentheses first, to avoid the expensive matrix by matrix 
multiplication. In Step 3), we choose the vector (link) 
that minimizes the error the most. Finally, we update 
matrix A for use in the next iteration. Step 2) requires 
0(mLJ) operations and Step 3) 0{mLJL). 

Lemma 4. The computational complexity of the Fast 
Greedy Randomized algorithm is 0(mKL 2 J). 

C. Ensemble method for randomized algorithms 

Algorithms PCAPH, FGP and FGR can be character- 
ized as randomized algorithms. PCAPH and FGP require 
the usage of the power method which utilizes a random 
vector for its initial operation. FGR uses m random 
vector to appoximate the largest eigenvalue. The main 
idea of the ensemble method is to pick a small m (since 
by Lemmas 1,3 and 4 m affects the running time of 
the algorithms) and start several, say r, independent and 
parallel instances of the algorithms. Eventually, we keep 
the solution set that yields the minimum prediction error. 
By using multiple, parallelized executions of randomized 
versions of our algorithms, we arrive at non-greedy 
solutions. For small and medium-scale networks this can 



often come close and, in fact, match the optimal solution 
of Problem (fl~|i (see Figure |l(c)| i. 

III. Performance Evaluation 

A. Greedy Vs. Exact algorithms 

We use the real-world network, namely Internet2, to 
evaluate the solution that greedy heuristics give against 
the exact solution obtained via exhaustive search. We 
also calculate and demonstrate the PCA lower bound. 
Interne t2 involves a network with L = 26 links and 
J = 72 flows (see [13]). We assume flows with co- 
variance matrix S x = Ij. In Figure |l(a)| we examine 
the trace criterion and in Figure [T(b)| the spectral norm 
one. The exhaustive search required several hours to 
converge to a solution (the implementation was done 
in Matlab using CPLEX). On the contrary, the greedy 
algorithms terminate in seconds and the prediction error 
at the solution set they converge to is very close to 
the minimum prediction error obtained by the exact 
algorithm. Observe also that the prediction error of the 
PCA projection heuristic is only marginally different 
from the one we get with the greedy algorithm. 

Ensembling is depicted in Figure |l(c)| Here we com- 
pare the FGR algorithm against the classical greedy. We 
verify that running r = 128 parallel implememtations 
of FGR with just m = 20 computationally outperforms 
classical greedy and yields solutions with prediction 
error closer to the exact solution. 

B. Comparisons of Approximation Algorithms 

In this section we juxtapose the computational perfor- 
mance of our fast approximation algorithms against the 
classical greedy heuristic. We run our simulations on a 
moderate network of N = 100 nodes generated using 
preferential attachment, as described in [14]. We assume 
J source-destination (s,d) flows with identical traffic, 
such that Xj ~ Af(/i, l),j G {1, 2, J}. The route for 
each (s, d) pair is chosen using shortest-path routing. We 
created a network with a routing matrix A of L = 195 
links and J — 500 flows. 

Figure [2] illustrates the results for A-optimality. Our 
proposed heuristics are notably faster than the naive 
implementation of the greedy algorithm. Our PCA pro- 
jection heuristic significantly outperforms all other al- 
gorithms at the expense of having a slightly larger pre- 
diction error. Figure [3] depicts the case of E-optimality. 
Again, our algorithms perform faster than the naive im- 
plementation of the greedy heuristic. We used m = 100 
iterations for the power method of the FGP algorithm and 
m = 100 random vectors for the FGR algorithm. The 
difference in the prediction error is due to the approxi- 
mation techniques used for the calculation of the largest 
eigenvalue. We use the PCA lower bound to assess the 
solution quality of the algorithms, since obtaining the 
exact solution is not computationally feasible. 
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Fig. 1. Evaluation of approximation algorithms on Internet2. 



Performance of approximation algorithms for the trace criterion 




PCA Projection Heuristic (m=100). Elapsed time (sees): 3.3 
Fast Greedy Exact. Elapsed time (sees): 20.0 
Classical greedy. Elapsed time (sees): 39.5 
PCA lower bound 



TABLE I 

Time elapsed for K=100, Inet topology. 
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Fig. 2. Prediction error for A-optimality on a network with 100 nodes. 
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Fig. 3. Prediction error for E-optimality on a network with 100 nodes. 



C. Scaling Up 

In this section, we evaluate our algorithms in a large- 
scale network constructed via the network generation 
tool Inet Q3) . The generated network has N — 3050 
nodes, and yields a matrix A with L = 4800 links and 
J = 1000 flows. The naive implementation of greedy 
was not executed for this scenario; based on its inferior 
performance observed earlier, such execution would be 
impractical. Indeed, several hours were needed even 
for the execution of the greedy heuristics FGE, FGP 
and FGR, that were shown to be much faster than the 
classical greedy. On the other hand, the PCA projection 
heuristic (ran with m = 100) terminated in less than 



PCAPH 
Time (sees) 
54 



FGE 
Time (sees) 
10017 



FGP 
Time (sees) 
26316 



FGR 
Time (sees) 
396 



a minute. This suggest that this heuristic consists an 
appealing, advantageous and robust option for extremely 
large networks. 

IV. Discussion 

Classical greedy algorithms are impractical for solving 
large-scale combinatorial problems like the one studied 
in this paper. Such problems arise in optimal network 
monitoring, variable selection and sparse approximation. 
In this paper, we present fast heuristics that outperform 
the classical greedy and are amenable for use in large- 
scale problems. Our approach can be also applied in the 
case of a weighted monitoring design problem (see Ap- 
pendix let. That is, links are assigned weights which rep- 
resent their significance; links with higher significance 
should be included in the monitoring set with higher 
priority than others. Our algorithms exploit connections 
to principal component analysis (PCA), and yields new 
types of performance lower bounds which do not re- 
quire submodularity of the objective functions. However, 
sobmodularity offers (1 — 1/ e) -approximation algorithms 
and, thus, our future plans include investigation of upper 
bounds on the performance of our algorithms. Moreover, 
we plan to examine how the routing structure imposed 
by matrix A appears in the optimal design under each 
criterion, and interpret the differences in the solutions 
obtained from the different optimality criteria. 

Appendix A 

Principal Components Analysis Lower Bounds 

Let Ti y = AA T be the covariance matrix for the model 
y = Ax. Let the singular value decomposition for A 
be A = UD 1 / 2 V T , with D 1 / 2 = diag(ai, a 2 , ■ ■ ■ , er p ), 
p = min{jL, J}, the singular values of A. Thus, Ai > 



• • • > At > A^ + i > • • • A p are the eigenvalues of the 
J x J matrix A T A Then the following theorems apply: 

Theorem 1 (Geometric Interpretation of PCA bound for 
trace). A standard result on the properties of the PCA 
implies that 

L J 

sub — spaccsw * — ' * — ' 
dim(W)=k e=1 j=k+l 

and the space W achieving the lower bound is spanned 
by the eigenvectors corresponding to the largest k eigen- 
values. 

Proof: Recall that the error covariance matrix can 
be written as 

S crr (0) = A(I - P )A T , 

where P := A^(AoA^)- 1 A Q . The vector (J - Po)a e 
is the perpendicular dropped from ag to Range(Aj), 
and hence — Po)a^|| is the distance from ag to 
Range(Ag ) = span(a<>, I e O), where A^ — {ai)i^o- 
The projection matrix (7 — Pq) is symmetric and 
indempotent (i.e., (I — Pq) = (I — Pq) 2 ), so we have 
Son- = [(/ - Po)A T ] T [{I ~ P )A T ], and therefore, 

L L 

trace(E err (0)) = ^ ||(/-P )a,|| 2 = ^ dist(o/, W Q ) 



Lemma 6. Using Theorem (ffj, f/ie following lower 
bound holds: 

p(S err (0)) = p[A(/-P )^ T ] > ||A-i4i\||l = Afc+i. 

(21) 

Proof: We first calculate the lower bound when 
Pk = V k V k T . We use the SVD of A, A = UD Y I 2 V T and 
the projection matrix I — P k = Udiag(0|\ lJ / _ k )V T . 
Thus, we have 

A - AP k = UD^ 2 V T Vdiag(0l, l T L „ k )V T 
= t/P^diag^, 1 
= [/diag(Ofc,cr fc+ i, . . 

By the definition of spectral norm: 

\\A-AP k \\ 2 2 =p((Udia,g(0l,a k+1 , 



L-k 



w 1 



...f/diag(O fc ,o- fe+ i,.. 
= p(Vdiag(o£,A fe+ i, 
= Afe + i. 



,°l)V t ) t - 
■,<r L )V T ) 
...,Xl)V t ) 



(18) 

where Wo '■= span(a^, I S O) is the sub-space 
spanned by the vectors at corresponding to the observed 
links. Using Proposition 1 of HI we see that the sum 
of ( TT~8T > is minimized when the projection matrix Pq 
equals P k = V k V k T , where the columns of the matrix 
V k are the top k right singular vectors of A, and the 
minimum value is ^2j=k+i ' 

Lemma 5. Using Theorem\l\ the following lower bound 
holds: 



Consequently, 

p(E clT (0)) = p [A(I - P )A T ] 

= p[(A-APo)(A-AP ) T ] 
= p [(A- AP ) T {A - APo)} 
> \\A-AP k \\l 

= A fe +i. (22) 



For the last inequality we used the result of Theorem (O 
and the lower bound obtained above. ■ 

Appendix B 
Proposition: identical error reduction 



Proof of Proposition^ We will show by induction 



that 



tr(E err (0)) = tx[A{I - P )A T ] 



A (k) A (k) T =A (jj 



> \\A-AP k \\%= Y; a, 



(19) 



i=k+l 



Induction Basis: For k 
following is true: 



o 

= 2 we will prove that the 



where P k = V k V k T 



A similar result holds for the spectral norm case. It 
is known from PCA analysis (see [12|, Theorem 2.5.3, 
and flU) that: 



„(i)„(D T 

a i 2 ^ :)A ar 



a 



Wi 



Theorem 2. 



mm 

rank(B)—k 



\A-Bh=\\A-AP, 



k 2 



(20) 



= A(I - Al{A AlY 1 A )A T =: 
where A^ = (0^,0^). For brevity, we define the 

n. : ' n. : ' 

projection P k as P k = I — 



where P k is the projection matrix P k = V k V k T . The 
columns of matrix V k are the top k right singular vectors 
of A, i.e$ P k = Udiag(l?\o£_ fe )V T . 

5 We symbolize the vector of ones of dimension k as 1^ and the 
vector of zeros as k . 



We have: 



A (1 \l 



(i) (i) 

ll«( 1 )||2 



)A d) T = a Wp 2 aW 



A^P 1 P 2 P[A^ T = AP X P 2 P^ 



It suffices to show that I - A^(A A^y 1 A = PiP 2 P 1 . 
Indeed, P X P 2 P X = P xi> a a>x = P pan{aii , a U> } x. We 
have used the identity 



similar manner we have a 



(i) 



*k + l 



implying span^^J = span{a ifc+1 }. Thus, 



P 



Xspan{ui,U2,-..,u r }- 1 - Pit^P-u 



P 



Spai^a^ ,a,i 2 

p. 



k l k + l' 



where U\, u 2 , . • • , u r are orthonormal vectors. Note 
that span{ai 1 ,a; '} — span{ai i; a i2 } because = 
di 2 - P ail (aj 2 ). Therefore, 

PlP2Pl = P S pa„ {ail ,a i2 K = 1 - AT o( A oA T Y 1 A 

by the definition of a projection matrix. This completes 
the proof of the basis of induction. 

Induction Hypothesis: Assume the following is true 
for step k: 

0-1) (fc-i) T 



a 



A(J-^(A^)-M )A 5 



A(I 

' I j , • • • ; a i fc )• 



which means that A^A^ 
A T {A A T )~ 1 A )A T for A T Q = (a n ,a,, 

Induction Step: Show that for iteration k + 1 we have: 

(fe) (fc) T 

where A T a = (a^ , a* 2 , . . . , a ih , a ik+1 ). 
Proof of Induction Step: 



A^ k \l- 



(k) (fc) 
a) ' a) ' 



,(fe) 



II II 

= APiP 2 . . . P k P k+l P k . . . P 2 PiA T 

From the induction hypothesis we have: 

P 1 P 2 ...P k ... P 2 P X =1- A T (A A T )- X A 
^P x P 2 ...P k =I- A T o{k) {A o{k) A T o{k) )- l A o{k) 

where o(k) is the set of the links selected by the 
greedy heuristic up to step k. Hence, A^^ = 

, dj 2 , . . . , CLi k )- 

Also, we have 

P\P 2 . . . P k P k +l = -Pspanjaij ,a i; 



P 



spanla^ ,ai. 



.aiJ-L-Tfc+i 
' k >k+l J 



With the help of geometry we can see that 

spanja^^} = span{ai fc+1 }. This is because 



l ik+l 



Ak-1) 
l ik + l 



which 



implies sparrfa^jj 
a; = a. 



- P (fc - 1 ,(a, (fe " 1) ) 

= span{a^~ j 1 ' ) }. Similarly, 

- P a(fc -2)(a|^ i 2) ) implies 

*k-l 

span{a^ +i lj } = span{a^ +i 2 ' ) }. Continuing in a 



span{ai 1 ,ai 2 ,...,a,i k ,ai fc ^ 1 1- 1 - 
= ^ _ A o(k+l)( A o(k+l) A o(k+l)) ly ^o(fc+l) 

and the induction step proof follows from that. ■ 

Appendix C 
Weighted Link Monitoring 

Recall that in the trace criterion we minimize: 

tr(E[(y - y){y - y) T ]) = E(tr[(y - y)(y - y) T ]) 
= E[\\y-y\\ 2 } 
= E[(y - y) T (y - y)]. 

(23) 

The weighted monitoring design problem aims to 
minimize E[(y — y) T G(y — y)}, where G := 
diag(wi, . . . ,wl) is the matrix assigning the link 
weights. After some algebra we get: 

E[(y-y) T G(y-y)} (24) 

= E[(G^y-G^yy{G^y- G ^y)] 

= E[(y G - y G y (y G - y G )], (25) 

where y G := G x l 2 y is the predictor for the random 
vector y G := G x / 2 y. This implies that if we apply 
our algorithms to the imposed "routing" matrix A G := 
G l t 2 A, we will obtain the optimal predictor for y G . The 
optimal predictor for the original vector of links, y, is 
simply obtained by y = G~ x l 2 y G . 
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